241 research outputs found
Sciunits: Reusable Research Objects
Science is conducted collaboratively, often requiring knowledge sharing about
computational experiments. When experiments include only datasets, they can be
shared using Uniform Resource Identifiers (URIs) or Digital Object Identifiers
(DOIs). An experiment, however, seldom includes only datasets, but more often
includes software, its past execution, provenance, and associated
documentation. The Research Object has recently emerged as a comprehensive and
systematic method for aggregation and identification of diverse elements of
computational experiments. While a necessary method, mere aggregation is not
sufficient for the sharing of computational experiments. Other users must be
able to easily recompute on these shared research objects. In this paper, we
present the sciunit, a reusable research object in which aggregated content is
recomputable. We describe a Git-like client that efficiently creates, stores,
and repeats sciunits. We show through analysis that sciunits repeat
computational experiments with minimal storage and processing overhead.
Finally, we provide an overview of sharing and reproducible cyberinfrastructure
based on sciunits gaining adoption in the domain of geosciences
Utilizing Provenance in Reusable Research Objects
Science is conducted collaboratively, often requiring the sharing of
knowledge about computational experiments. When experiments include only
datasets, they can be shared using Uniform Resource Identifiers (URIs) or
Digital Object Identifiers (DOIs). An experiment, however, seldom includes only
datasets, but more often includes software, its past execution, provenance, and
associated documentation. The Research Object has recently emerged as a
comprehensive and systematic method for aggregation and identification of
diverse elements of computational experiments. While a necessary method, mere
aggregation is not sufficient for the sharing of computational experiments.
Other users must be able to easily recompute on these shared research objects.
Computational provenance is often the key to enable such reuse. In this paper,
we show how reusable research objects can utilize provenance to correctly
repeat a previous reference execution, to construct a subset of a research
object for partial reuse, and to reuse existing contents of a research object
for modified reuse. We describe two methods to summarize provenance that aid in
understanding the contents and past executions of a research object. The first
method obtains a process-view by collapsing low-level system information, and
the second method obtains a summary graph by grouping related nodes and edges
with the goal to obtain a graph view similar to application workflow. Through
detailed experiments, we show the efficacy and efficiency of our algorithms.Comment: 25 page
The SDSS SkyServer, Public Access to the Sloan Digital Sky Server Data
The SkyServer provides Internet access to the public Sloan Digital Sky Survey
(SDSS) data for both astronomers and for science education. This paper
describes the SkyServer goals and architecture. It also describes our
experience operating the SkyServer on the Internet. The SDSS data is public and
well-documented so it makes a good test platform for research on database
algorithms and performance.Comment: submitted for publication, original at
http://research.microsoft.com/scripts/pubs/view.asp?TR_ID=MSR-TR-2001-10
Adaptive Physical Design for Curated Archives
We introduce AdaptPD, an automated physical design tool that improves database performance by continuously monitoring changes in the workload and adapting the physical design to suit the incoming workload. Current physical design tools are offline and require specification of a representative workload. AdaptPD is âalways onâ and incorporates online algorithms which profile the incoming workload to calculate the relative benefit of transitioning to an alternative design. Efficient query and transition cost estimation modules allow AdaptPD to quickly decide between various design configurations. We evaluate AdaptPD with the SkyServer Astronomy database using queries submitted by SkyServerâs users. Experiments show that AdaptPD adapts to changes in the workload, improves query performance substantially over offline tools, and introduces minor computational overhead
The Second Data Release of the Sloan Digital Sky Survey
The Sloan Digital Sky Survey (SDSS) has validated and made publicly available its Second Data Release. This data release consists of 3324 deg2 of five-band (ugriz) imaging data with photometry for over 88 million unique objects, 367,360 spectra of galaxies, quasars, stars, and calibrating blank sky patches selected over 2627 deg2 of this area, and tables of measured parameters from these data. The imaging data reach a depth of r â 22.2 (95% completeness limit for point sources) and are photometrically and astrometrically calibrated to 2% rms and 100 mas rms per coordinate, respectively. The imaging data have all been processed through a new version of the SDSS imaging pipeline, in which the most important improvement since the last data release is fixing an error in the model fits to each object. The result is that model magnitudes are now a good proxy for point-spread function magnitudes for point sources, and Petrosian magnitudes for extended sources. The spectroscopy extends from 3800 to 9200 Ă
at a resolution of 2000. The spectroscopic software now repairs a systematic error in the radial velocities of certain types of stars and has substantially improved spectrophotometry. All data included in the SDSS Early Data Release and First Data Release are reprocessed with the improved pipelines and included in the Second Data Release. Further characteristics of the data are described, as are the data products themselves and the tools for accessing them
The First Data Release of the Sloan Digital Sky Survey
The Sloan Digital Sky Survey has validated and made publicly available its
First Data Release. This consists of 2099 square degrees of five-band (u, g, r,
i, z) imaging data, 186,240 spectra of galaxies, quasars, stars and calibrating
blank sky patches selected over 1360 square degrees of this area, and tables of
measured parameters from these data. The imaging data go to a depth of r ~ 22.6
and are photometrically and astrometrically calibrated to 2% rms and 100
milli-arcsec rms per coordinate, respectively. The spectra cover the range
3800--9200 A, with a resolution of 1800--2100. Further characteristics of the
data are described, as are the data products themselves.Comment: Submitted to The Astronomical Journal. 16 pages. For associated
documentation, see http://www.sdss.org/dr
Estimating query result sizes for proxy caching in scientific database federations
In a proxy cache for federations of scientific databases it is important to estimate the size of a query before making a caching decision. With accurate estimates, near-optimal cache performance can be obtained. On the other extreme, inaccurate estimates can render the cache totally ineffective. We present classification and regression over templates (CAROT), a general method for estimating query result sizes, which is suited to the resource-limited environment of proxy caches and the distributed nature of database federations. CAROT estimates query result sizes by learning the distribution of query results, not by examining or sampling data, but from observing workload. We have integrated CAROT into the proxy cache of the National Virtual Observatory (NVO) federation of astronomy databases. Experiments conducted in the NVO show that CAROT dramatically outperforms conventional estimation techniques and provides near-optimal cache performance. 1
- âŠ